-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Querying Functionality to OSB #409
Conversation
Adds random query workloads to both the train and no-train test procedures. Adds custom parameter source to produce the queries. Add usage of the parameter source to both json files. Updated documentation. Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds custom param source that will allow users to pull queries from a data set as opposed to using random queries. Along with this, refactored parameter sources to share common functionality. Updated README Signed-off-by: John Mazanec <jmazane@amazon.com>
Reads query vecs from data set in batches to avoid making too many disk reads. Batch size is hardcoded to 100. Signed-off-by: John Mazanec <jmazane@amazon.com>
Add custom query recall runner so that we can eventually compute the recall of queries. Currently, recall value is hard coded but this will be implemented in the future. Signed-off-by: John Mazanec <jmazane@amazon.com>
Add ability to compute recall score for the customer query runner. Currently, to compute recall, it checks how many of the top k returned results appear in the ground truth set. Signed-off-by: John Mazanec <jmazane@amazon.com>
Cleans up documentation and tracks with addition of query and compute recall functionality. Signed-off-by: John Mazanec <jmazane@amazon.com>
Codecov Report
@@ Coverage Diff @@
## main #409 +/- ##
=========================================
Coverage 84.01% 84.01%
Complexity 911 911
=========================================
Files 130 130
Lines 3879 3879
Branches 359 359
=========================================
Hits 3259 3259
Misses 458 458
Partials 162 162 Continue to review full report at Codecov.
|
I haven't used this functionality myself but this seems like it should work. Is any of the data defined here showing up in the results? Is it just If you haven't already you could try configuring OpenSearch Benchmark to write to an OpenSearch cluster. That would give you access to the full set of raw metrics. |
@travisbenedict Ive tried a few variations of it, but no, only query latency, thoughput, service time, and error rate get output. In Rally docs, it said that custom metrics would be added in meta data about the operation, but I am not sure how to find those or generate those if it is not connected to an OpenSearch cluster. Also, ideally, I would like to get the results in the summary. Here is a current sample of the results: https://gist.github.com/jmazanec15/82b91eaad4af8acd773fbc97ba25b638. |
Removes recall calculation from benchmarking logic as this is delayed until opensearch-project/opensearch-benchmark#199 can be implemented. Signed-off-by: John Mazanec <jmazane@amazon.com>
Removes random query. Random query may be misleading if the distribution of the index data is significantly different than that of the randomness. Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Adds unit tests for param sources for benchmarking. In addition, adds a test utility to create data sets dynamically. Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
Signed-off-by: John Mazanec <jmazane@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few general questions:
- how we gonna handle OSB version updates, I think we use officially supported extension points, but just want to re-confirm that we minimize changes of breaking things on our end with upgrade to a new OSB version
- do you want to use multiple clients for queries in our benchmarks (say for k-NN release)? We probably need to come up with some formula to estimate number of clients based on cluster configuration.
Returns: | ||
The parameter source for this particular partion | ||
""" | ||
if self.num_vectors % total_partitions != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this mean that the data set size must be divisible by the number of parallel clients?
If so I think in next revision we need to relax this requirement and divide evenly except for last client that will have the remainder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes thats a good point. I can update this in a future PR. Will create an issue when PR is merged.
@martin-gaievski Good question. I think it will most likely be addressed at a later date. Right now, we don't release the benchmarks as artifacts and we hard code dependency to OSB in requirements.txt. I think eventually we will want to transfer things to https://github.com/opensearch-project/opensearch-benchmark-workloads/ and when we do that we can ensure version compatibility. |
Yes, I think this PR will focus more on providing functionality of benchmarking tool. In a future PR, we will make decisions on configuration. We need some kind of standardization of performance testing for releases. |
Description
Adds ability to run query workload from a data set with OpenSearch Benchmark tool for k-NN workloads. Refactors some of the code to better share components across extensions.
In addition, added unit tests for testing custom param sources.
For recall metrics, tracking issue here: opensearch-project/opensearch-benchmark#199. This will not be covered in this PR.
Issues Resolved
#373
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.